feat: Implement ZIP64 support in pzip for large file handling#345
Conversation
- Introduced forceFlush method to ensure all buffered data is output immediately. - Updated flushOutput method to only flush when the buffer exceeds 256KB, optimizing system calls. - Modified storeFast and close methods to utilize forceFlush for final data output. Log: Enhance data flushing mechanisms in FlateWriter
- Added constants for ZIP64 thresholds and version requirements. - Enhanced ZipFileHeader with isZip64() method to determine ZIP64 status. - Updated writeLocalFileHeader and writeDataDescriptor methods to handle ZIP64 sizes. - Modified writeCentralDirectory and writeEndOfCentralDirectory to accommodate ZIP64 records and offsets. - Introduced ZIP64 extra fields for central directory entries and end of central directory records. Log: Improve ZIP file handling for large files with ZIP64 support
Reviewer's GuideAdds ZIP64 support to the pzip ZipWriter so it can handle large files and large central directories, and adjusts the deflate writer buffering to flush less often while still guaranteeing all data is flushed on close. Sequence diagram for ZipWriter close with ZIP64 central directory and EOCD handlingsequenceDiagram
actor Client
participant ZipWriter
participant File as file_
Client->>ZipWriter: close()
activate ZipWriter
ZipWriter->>ZipWriter: centralDirOffset = currentOffset_
ZipWriter->>ZipWriter: writeCentralDirectory()
activate ZipWriter
loop for each entry in centralDir_
ZipWriter->>ZipWriter: needZip64 = header.isZip64() || localHeaderOffset >= ZIP_UINT32_MAX
alt needZip64
ZipWriter->>ZipWriter: write central dir header with ZIP_UINT32_MAX sizes
ZipWriter->>ZipWriter: build zip64Extra (sizes, offset)
else not needZip64
ZipWriter->>ZipWriter: write central dir header with 32-bit sizes
end
ZipWriter->>File: write fixed central dir fields
ZipWriter->>File: write filename
ZipWriter->>File: write original extra
alt needZip64
ZipWriter->>File: write zip64Extra
end
ZipWriter->>ZipWriter: update currentOffset_
end
deactivate ZipWriter
ZipWriter->>ZipWriter: centralDirSize = currentOffset_ - centralDirOffset
ZipWriter->>ZipWriter: writeEndOfCentralDirectory(centralDirOffset, centralDirSize)
activate ZipWriter
ZipWriter->>ZipWriter: records = centralDir_.size()
ZipWriter->>ZipWriter: size = centralDirSize
ZipWriter->>ZipWriter: offset = centralDirOffset
ZipWriter->>ZipWriter: needZip64 = (records >= ZIP_UINT16_MAX || size >= ZIP_UINT32_MAX || offset >= ZIP_UINT32_MAX)
alt needZip64
ZipWriter->>File: write ZIP64 EOCD record
ZipWriter->>File: write ZIP64 EOCD locator
ZipWriter->>ZipWriter: records = ZIP_UINT16_MAX
ZipWriter->>ZipWriter: size = ZIP_UINT32_MAX
ZipWriter->>ZipWriter: offset = ZIP_UINT32_MAX
ZipWriter->>ZipWriter: update currentOffset_
end
ZipWriter->>File: write classic EOCD with records, size, offset
ZipWriter->>File: write comment
deactivate ZipWriter
ZipWriter->>File: close()
deactivate ZipWriter
ZipWriter-->>Client: close() result
Sequence diagram for FlateWriter output buffering and flushingsequenceDiagram
actor Client
participant FlateWriter
participant BitWriter as writer_
participant Output as output_
Client->>FlateWriter: fillBlock(data, size)
activate FlateWriter
FlateWriter->>BitWriter: write compressed tokens
FlateWriter->>FlateWriter: flushOutput()
activate FlateWriter
FlateWriter->>BitWriter: buf = writer_.data()
alt buf.size() >= 256KB and output_ set
FlateWriter->>Output: output_(buf.data(), buf.size())
FlateWriter->>BitWriter: buf.clear()
else buffer below threshold
FlateWriter->>FlateWriter: do not flush
end
deactivate FlateWriter
deactivate FlateWriter
Client->>FlateWriter: close()
activate FlateWriter
FlateWriter->>FlateWriter: if already closed return
FlateWriter->>FlateWriter: finish pending compression
FlateWriter->>BitWriter: writer_->flush()
FlateWriter->>FlateWriter: forceFlush()
activate FlateWriter
FlateWriter->>BitWriter: buf = writer_.data()
alt buf not empty and output_ set
FlateWriter->>Output: output_(buf.data(), buf.size())
FlateWriter->>BitWriter: buf.clear()
else no remaining data
FlateWriter->>FlateWriter: nothing to flush
end
deactivate FlateWriter
FlateWriter->>FlateWriter: mark closed
deactivate FlateWriter
FlateWriter-->>Client: close() result
Class diagram for updated ZIP64-related structures and writersclassDiagram
class ZipFileHeader {
+string name
+uint16_t versionMadeBy
+uint16_t versionNeeded
+uint16_t flags
+uint16_t compressionMethod
+time_t modTime
+uint32_t crc32
+uint64_t compressedSize
+uint64_t uncompressedSize
+uint16_t internalAttr
+uint32_t externalAttr
+bool isDirectory()
+bool isZip64()
}
class ZipWriter {
-std::ofstream file_
-uint64_t currentOffset_
-std::string comment_
-std::vector~CentralDirEntry~ centralDir_
+Error writeLocalFileHeader(ZipFileHeader header)
+Error writeDataDescriptor(ZipFileHeader header)
+Error writeCentralDirectory()
+Error writeEndOfCentralDirectory(uint64_t centralDirOffset, uint64_t centralDirSize)
+Error close()
+static void timeToDos(time_t t, uint16_t date, uint16_t dosTime)
}
class FlateWriter {
-std::unique_ptr~BitWriter~ writer_
-std::function~void(const uint8_t* data, size_t size)~ output_
-TokenBuffer tokens_
-uint32_t windowEnd_
-bool useL1_
+size_t fillBlock(const uint8_t* data, size_t size)
+void storeFast()
+void flushOutput()
+void forceFlush()
+void close()
+FastGen* encoder()
}
ZipWriter --> ZipFileHeader : uses
FlateWriter --> ZipWriter : feeds compressed data
File-Level Changes
Tips and commandsInteracting with Sourcery
Customizing Your ExperienceAccess your dashboard to:
Getting Help
|
There was a problem hiding this comment.
Hey - I've found 4 issues, and left some high level feedback:
- In writeCentralDirectory,
needZip64is computed withentry.localHeaderOffset >= ZIP_UINT32_MAXbut the actual central directory local header offset is only clamped whenentry.localHeaderOffset > ZIP_UINT32_MAX; to avoid inconsistent signalling of ZIP64 usage around the exact 0xFFFFFFFF boundary, consider using the same>= ZIP_UINT32_MAXcondition in both places.
Prompt for AI Agents
Please address the comments from this code review:
## Overall Comments
- In writeCentralDirectory, `needZip64` is computed with `entry.localHeaderOffset >= ZIP_UINT32_MAX` but the actual central directory local header offset is only clamped when `entry.localHeaderOffset > ZIP_UINT32_MAX`; to avoid inconsistent signalling of ZIP64 usage around the exact 0xFFFFFFFF boundary, consider using the same `>= ZIP_UINT32_MAX` condition in both places.
## Individual Comments
### Comment 1
<location> `3rdparty/pzip/src/zip_writer.cpp:393-379` </location>
<code_context>
- // Extra field length
- write16(static_cast<uint16_t>(h.extra.size()));
+ // 构建 ZIP64 extra 字段
+ std::vector<uint8_t> zip64Extra;
+ if (needZip64) {
+ // 28 bytes: 2x uint16 + 3x uint64
+ // Header ID
</code_context>
<issue_to_address>
**issue (bug_risk):** Building a new ZIP64 extra block may conflict with an existing ZIP64 extra in `h.extra`.
When `needZip64` is true, a new ZIP64 extra field is always appended without checking whether `h.extra` already includes one. If callers pre-populate a ZIP64 extra, this can create duplicate or conflicting entries. Consider either updating an existing ZIP64 extra if present, or clearly documenting that callers must not provide ZIP64 extra data and that this writer is the sole owner of it.
</issue_to_address>
### Comment 2
<location> `3rdparty/pzip/src/zip_writer.cpp:417-418` </location>
<code_context>
+ }
+ }
+
+ // Extra field length (原有 extra + ZIP64 extra)
+ size_t extraLen = h.extra.size() + zip64Extra.size();
+ write16(static_cast<uint16_t>(extraLen));
</code_context>
<issue_to_address>
**issue (bug_risk):** Potential truncation if combined extra field length exceeds `UINT16_MAX`.
`extraLen` is computed as `size_t` and then cast to `uint16_t` in `write16(static_cast<uint16_t>(extraLen));`. If `h.extra.size() + zip64Extra.size()` exceeds `0xFFFF`, this will be truncated and produce an invalid ZIP central directory entry. Please add a check for this condition and either fail with a clear error or otherwise handle the overflow explicitly.
</issue_to_address>
### Comment 3
<location> `3rdparty/pzip/src/fast_deflate.cpp:1174-1175` </location>
<code_context>
-// 将 writer_ 中的数据刷新到输出目标
+// 将 writer_ 中的数据刷新到输出目标(当缓冲区足够大时)
void FlateWriter::flushOutput() {
+ auto& buf = writer_->data();
+ // 只在缓冲区超过 256KB 时才刷新,减少系统调用次数
+ if (buf.size() >= 256 * 1024 && output_) {
</code_context>
<issue_to_address>
**question (bug_risk):** `flushOutput` now only flushes on large buffers, which changes its semantics and may surprise callers.
This optimization reduces syscalls but changes the meaning of "flush" and may increase peak memory or delay data delivery. Since `flushOutput` is used from `storeFast()` and `forceFlush()` is used from `close()`, this is likely intentional, but please (a) confirm no callers depend on `flushOutput` flushing immediately, or (b) consider renaming one of the methods to distinguish "soft" vs "hard" flush semantics more clearly.
</issue_to_address>
### Comment 4
<location> `3rdparty/pzip/src/zip_writer.cpp:197` </location>
<code_context>
}
Error ZipWriter::writeDataDescriptor(const ZipFileHeader& header) {
+ // 对应 Go archive/zip 的 writeDataDescriptor
+ // 如果 isZip64(),使用 24 字节(4+4+8+8),否则使用 16 字节(4+4+4+4)
</code_context>
<issue_to_address>
**issue (complexity):** Consider extracting the repeated little-endian writing and ZIP64 record/extra-field construction into shared helper functions to simplify these ZIP writer methods.
You can reduce the added complexity without changing behavior by extracting the repetitive little‑endian writing and ZIP64 building logic into small helpers, then using them in `writeCentralDirectory` and `writeEndOfCentralDirectory`.
### 1. Centralize little‑endian writers
Right now, each of `writeLocalFileHeader`, `writeDataDescriptor`, `writeCentralDirectory`, and `writeEndOfCentralDirectory` defines its own `write16`/`write32`/`write64` lambdas.
You can keep all behavior and simplify by introducing reusable helpers:
```cpp
// In an internal namespace / anonymous namespace in the .cpp
inline void appendLE16(std::vector<uint8_t>& buf, uint16_t v) {
buf.push_back(static_cast<uint8_t>(v));
buf.push_back(static_cast<uint8_t>(v >> 8));
}
inline void appendLE32(std::vector<uint8_t>& buf, uint32_t v) {
appendLE16(buf, static_cast<uint16_t>(v));
appendLE16(buf, static_cast<uint16_t>(v >> 16));
}
inline void appendLE64(std::vector<uint8_t>& buf, uint64_t v) {
appendLE32(buf, static_cast<uint32_t>(v));
appendLE32(buf, static_cast<uint32_t>(v >> 32));
}
```
Then e.g. in `writeDataDescriptor`:
```cpp
Error ZipWriter::writeDataDescriptor(const ZipFileHeader& header) {
std::vector<uint8_t> buf;
buf.reserve(header.isZip64() ? 24 : 16);
appendLE32(buf, DATA_DESCRIPTOR_SIG);
appendLE32(buf, header.crc32);
if (header.isZip64()) {
appendLE64(buf, header.compressedSize);
appendLE64(buf, header.uncompressedSize);
} else {
appendLE32(buf, static_cast<uint32_t>(header.compressedSize));
appendLE32(buf, static_cast<uint32_t>(header.uncompressedSize));
}
file_.write(reinterpret_cast<const char*>(buf.data()), buf.size());
if (!file_.good()) {
return Error(ErrorCode::FILE_WRITE_ERROR, "Failed to write data descriptor");
}
currentOffset_ += buf.size();
return Error();
}
```
Same pattern can replace the lambdas in `writeLocalFileHeader`, `writeCentralDirectory`, and `writeEndOfCentralDirectory`.
### 2. Factor ZIP64 extra field construction
The ZIP64 extra field logic in `writeCentralDirectory` currently has three manual `for (int i = 0; i < 8; ++i)` loops. That can be encapsulated to make the loop more declarative:
```cpp
inline void buildZip64Extra(const ZipFileHeader& h,
uint64_t localHeaderOffset,
std::vector<uint8_t>& zip64Extra) {
zip64Extra.clear();
zip64Extra.reserve(28); // 2 + 2 + 3 * 8
// Header ID
appendLE16(zip64Extra, EXTRA_ID_ZIP64);
// Data size = 24 (3x uint64)
appendLE16(zip64Extra, 24);
appendLE64(zip64Extra, h.uncompressedSize);
appendLE64(zip64Extra, h.compressedSize);
appendLE64(zip64Extra, localHeaderOffset);
}
```
Then `writeCentralDirectory` becomes simpler:
```cpp
for (auto& entry : centralDir_) {
auto& h = entry.header;
buf.clear();
bool needZip64 = h.isZip64() || entry.localHeaderOffset >= ZIP_UINT32_MAX;
appendLE32(buf, CENTRAL_DIR_HEADER_SIG);
appendLE16(buf, h.versionMadeBy);
appendLE16(buf, needZip64 ? ZIP_VERSION_45 : h.versionNeeded);
appendLE16(buf, h.flags);
appendLE16(buf, h.method);
appendLE16(buf, h.modTime);
appendLE16(buf, h.modDate);
appendLE32(buf, h.crc32);
if (needZip64) {
appendLE32(buf, ZIP_UINT32_MAX);
appendLE32(buf, ZIP_UINT32_MAX);
} else {
appendLE32(buf, static_cast<uint32_t>(h.compressedSize));
appendLE32(buf, static_cast<uint32_t>(h.uncompressedSize));
}
appendLE16(buf, static_cast<uint16_t>(h.name.size()));
std::vector<uint8_t> zip64Extra;
if (needZip64) {
buildZip64Extra(h, entry.localHeaderOffset, zip64Extra);
}
size_t extraLen = h.extra.size() + zip64Extra.size();
appendLE16(buf, static_cast<uint16_t>(extraLen));
appendLE16(buf, 0); // comment length
appendLE16(buf, 0); // disk number
appendLE16(buf, 0); // internal attr
appendLE32(buf, h.externalAttr);
appendLE32(buf, entry.localHeaderOffset > ZIP_UINT32_MAX
? ZIP_UINT32_MAX
: static_cast<uint32_t>(entry.localHeaderOffset));
// fixed part
file_.write(reinterpret_cast<const char*>(buf.data()), buf.size());
currentOffset_ += buf.size();
// name + original extra + zip64 extra
file_.write(h.name.c_str(), h.name.size());
currentOffset_ += h.name.size();
if (!h.extra.empty()) {
file_.write(reinterpret_cast<const char*>(h.extra.data()), h.extra.size());
currentOffset_ += h.extra.size();
}
if (!zip64Extra.empty()) {
file_.write(reinterpret_cast<const char*>(zip64Extra.data()), zip64Extra.size());
currentOffset_ += zip64Extra.size();
}
}
```
This keeps the logic the same but removes the low‑level byte loops from the core flow.
### 3. Split EOCD writing into small helpers
`writeEndOfCentralDirectory` currently builds ZIP64 EOCD, ZIP64 locator, and classic EOCD in one function with local `write16`/`write32`/`write64`. You can keep the signature and behavior but factor out the record builders:
```cpp
inline void buildZip64Eocd(uint64_t records,
uint64_t size,
uint64_t offset,
std::vector<uint8_t>& buf) {
buf.clear();
buf.reserve(56); // directory64EndLen
appendLE32(buf, ZIP64_END_OF_CENTRAL_DIR_SIG);
appendLE64(buf, 44); // size of remaining record
appendLE16(buf, ZIP_VERSION_45); // version made by
appendLE16(buf, ZIP_VERSION_45); // version needed
appendLE32(buf, 0); // disk number
appendLE32(buf, 0); // disk with start
appendLE64(buf, records);
appendLE64(buf, records);
appendLE64(buf, size);
appendLE64(buf, offset);
}
inline void buildZip64Locator(uint64_t zip64EocdOffset,
std::vector<uint8_t>& buf) {
// append to existing buf
appendLE32(buf, ZIP64_END_OF_CENTRAL_DIR_LOCATOR_SIG);
appendLE32(buf, 0); // disk with start of zip64 EOCD
appendLE64(buf, zip64EocdOffset);
appendLE32(buf, 1); // total number of disks
}
inline void buildClassicEocd(uint64_t records,
uint64_t size,
uint64_t offset,
const std::string& comment,
std::vector<uint8_t>& buf) {
buf.clear();
buf.reserve(22 + comment.size());
appendLE32(buf, END_OF_CENTRAL_DIR_SIG);
appendLE16(buf, 0); // this disk
appendLE16(buf, 0); // disk with start
appendLE16(buf, static_cast<uint16_t>(records));
appendLE16(buf, static_cast<uint16_t>(records));
appendLE32(buf, static_cast<uint32_t>(size));
appendLE32(buf, static_cast<uint32_t>(offset));
appendLE16(buf, static_cast<uint16_t>(comment.size()));
}
```
Use them in `writeEndOfCentralDirectory`:
```cpp
Error ZipWriter::writeEndOfCentralDirectory(uint64_t centralDirOffset,
uint64_t centralDirSize) {
uint64_t records = centralDir_.size();
uint64_t size = centralDirSize;
uint64_t offset = centralDirOffset;
std::vector<uint8_t> buf;
bool needZip64 =
(records >= ZIP_UINT16_MAX ||
size >= ZIP_UINT32_MAX ||
offset >= ZIP_UINT32_MAX);
if (needZip64) {
buildZip64Eocd(records, size, offset, buf);
buildZip64Locator(currentOffset_, buf); // offset of zip64 EOCD
file_.write(reinterpret_cast<const char*>(buf.data()), buf.size());
currentOffset_ += buf.size();
records = ZIP_UINT16_MAX;
size = ZIP_UINT32_MAX;
offset = ZIP_UINT32_MAX;
}
buildClassicEocd(records, size, offset, comment_, buf);
file_.write(reinterpret_cast<const char*>(buf.data()), buf.size());
if (!comment_.empty()) {
file_.write(comment_.c_str(), comment_.size());
}
if (!file_.good()) {
return Error(ErrorCode::FILE_WRITE_ERROR,
"Failed to write end of central directory");
}
currentOffset_ += buf.size() + comment_.size();
return Error();
}
```
This keeps the ZIP64 behavior intact but makes `writeEndOfCentralDirectory` mostly high‑level decisions, with low‑level packing isolated in short helpers.
Overall, these small extractions should address the “too complex / too repetitive” feedback while preserving all current functionality and ZIP64 behavior.
</issue_to_address>Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
deepin pr auto reviewGit Diff 代码审查报告整体评价这段代码主要是为 ZIP 文件写入器添加 ZIP64 支持,以处理大于 4GB 的文件。代码参考了 Go 语言标准库 1. 语法和逻辑审查1.1 fast_deflate.h/cpp优点:
改进建议:
1.2 file_task.h优点:
改进建议:
1.3 zip_writer.h/cpp优点:
改进建议:
2. 代码质量优点:
改进建议:
3. 代码性能优点:
改进建议:
4. 代码安全优点:
改进建议:
5. 其他建议
总结这段代码整体质量良好,正确实现了 ZIP64 支持。主要的改进空间在于:
建议在合并前进行充分的测试,特别是针对大文件和边界情况的测试。 |
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: dengzhongyuan365-dev, lzwind The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
|
/forcemerge |
feat: Implement ZIP64 support in pzip for large file handling
Log: Improve ZIP file handling for large files with ZIP64 support
Summary by Sourcery
Add ZIP64 support to the pzip writer to correctly handle large files and central directories beyond classic ZIP size limits.
New Features:
Enhancements: